NSF PAR Search | NSF Public Access Repository

Improving ASR Output for Endangered Language Documentation

https://doi.org/10.21437/SLTU.2018-39

Jimerson, Robert; Simha, Kruthika; Ptucha, Ray; Prud'hommeaux, Emily (August 2018, The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages)

Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task.

Full Text Available

Search for: All records